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Abstract 

Consider the problem of a controller sampling sequentially from a finite number of V > 2 pop¬ 
ulations, specified by random variables Xl, i = 1,..., N, and k — 1,2,...; where Xl denotes the 
outcome from population i the k*'^ time it is sampled. It is assumed that for each hxed i, {Xl}k>i 
is a sequence of i.i.d. uniform random variables over some interval [ai,bi], with the support (i.e., 
ai,bi) unknown to the controller. The objective is to have a policy tt for deciding, based on 
available data, from which of the N populations to sample from at any time n = 1, 2,... so as to 
maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to 
lack on information of the parameters {oi} and {bi}. In this paper, we present a simple UCB-type 
policy that is asymptotically optimal. Additionally, finite horizon regret bounds are given. 

Keywords: Inflated Sample Means, Upper Confidence Bound, Mnlti-armed Bandits, Sequential 

Allocation 

Introduction and Summary 


1. Main Model 

Let be a known family of probability densities on R, each with finite mean. We define /r(/) to 
be the expected value under density /, and Sp(/) to be the support of /. Consider the problem of 
sequentially sampling from a finite number of A > 2 populations or ‘bandits’, where measurements 
from population i are specified by an i.i.d. sequence of random variables {Ar^}fe>i with density 
fi € J- ■ We take each fi as unknown to the controller. It is convenient to define, for each i, fii = 
Pi.fi) = fsp(f) xf{x)dx, and /x* = /x*({/J) = maxi^(/d. Additionally, we take A* = /x* - > 0, 

the discrepancy of bandit i. 

We note, but for simplicity will not consider explicitly, that both discrete and continnous distri¬ 
butions can be studied when one takes {A^}fc>i to be i.i.d. with density fi, with respect to some 
known measure 

For any adaptive, non-anticipatory policy tt, 7r(t) = i indicates that the controller samples bandit 
i at time t. Define Tf(n) = = 0) denoting the number of times bandit i has been 

sampled during the periods t = 1,.. ., n under policy tt; we take, as a convenience, TffS ) = 0 for all 
x,7r. The value of a policy tt is the expected sum of the first n outcomes under tt, which we define 
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to be the function : 


14 (n) = E 


N Ti{n) 

Y. Y 


k—1 


N 

= [Tlin)], 


( 1 ) 


where for simplicity the dependence of 14(n) on the unknown densities {fi} is suppressed. The 
regret of a policy is taken to be the expected loss due to ignorance of the underlying distributions 
by the controller. Had the controller complete information, she would at every round activate some 
bandit i* such that = /i* = max^ fit. For a given policy tt, we define the expected regret of that 
policy at time n as 


{n) = n/x* - 14 (n) = ^ A,E [T^ (n)] . (2) 

i=l 

We are interested in policies for which 14 (n) grows as fast as possible with n, or equivalently that 
grows as slowly as possible with n. 


2. Preliminaries - Background 


We restrict T in the following way: 

Assumption 1. Given any set of bandit densities {/iliLu for any sub-optimal bandit x, i.e., 
4 there exists some f^ £ T such that Sp(/i) D Sp(/i), and ^(4) > 

Effectively, this ensures that at any finite time, given a set of bandits under consideration, for any 
bandit there is a density in R that would both potentially explain the measurements from that 
bandit, and make it the unique optimal bandit of the set. 

The focus of this paper is on JF as the set of uniform densities over some unknown support. 

Let I(/, (?) denote the Kullback-Liebler divergence of density / from g. 


I(/>5)=/ In /(a;)dx = E/ : 


1 


(3) 


It is a simple generalization of a classical result (part 1 of Theorem 1) of |Burnetas and Katehakis 
(1996b) that if a policy tt is uniformly fast (UF), i.e., R-yrin) = o{n‘^) for all a > 0 and for any choice 
of {fi} C R, then, the following bound holds: 


liminf > Mbk({4}), for all {/J C R, (4) 

n Inn 

where the bound MBK({/i}) itself is determined by the specihc distributions of the populations: 

A,; 


Mbk({4})= ^ 




. infgG.F{I(/i,5) : Kd) > 9*}' 


( 5 ) 


For a given set of densities R, it is of interest to construct policies tt such that 

lim = MBK({/i}), for all {/J C R. 

n Inn 
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Such policies achieve the slowest (maximum) regret (value) growth rate possible among UF policies. 
They have been called UM or asymptotically optimal or efficient, cf. [Burnetas and Katehakis| 
( |1996bD . 

For a giv en f G T. let fi- G T he an esti mator of / based on the first k samples from /. It was 


shown in Burnetas and Katehakis (1996bI that under sufficient conditions on {/^}, asymptotically 
optimal (UM) UCB-policies could be constructed by initially sampling each bandit some number of 
no times, and then for n > no, following an index policy: 

7r°(n + l)= argmaxj {M*(n,T^o(n))}, (6) 

where the indices u^{n,t) are ‘inflations of the current estimates for the means’ (ISM), specified as: 


u\n,t) = UQ^{n,t,ft) = sup n{g) 

gGJF 


Inn 

I(/t , 9 ) < — 


(7) 


The sufficient conditions on the estimators {fl} are as follows: 

Defining 

J(/,c) = inf {I(/,g) : g{g) > c}, 

g&F 

for all choices of {fi} C J- and all e > 0, 5 > 0, the following hold for each f, as fc —)■ 00 . 
Cl: P -e)< - e) - s) = o{l/k). 

C2: P (u\i^{k,j,fj) < gi- e, for some j S {no,..., = o(l/fc). 


These conditions correspond to Conditions A1-A3 given in Burnetas and Katehakis ( 1996b| . How¬ 
ever under the stated Assumption 1 on given here. Condition A1 therein is automatically satisfied. 


Conditions A2 (see also Remark 4(b) in Burnetas and Katehakis (1996b|) and A3 are given as Cl 


and C2, above, respectively. Note, Condition (Cl) is essentially satisfied as long as fl converges to 
fi (and hence 3{fl,g,* — e) —>■ J(/i,/r* — e)) sufficiently quickly with k. This can often be verified 
easily with standard large deviation principles. The difficulty in proving the optimality of policy 
is often in verifying that Condition (C2) holds. 


Remark 1 The above discussion is a parameter-free variation of that in \Burnetas and Katehaki^ 
(1996b), where P was taken to be parametrizable, i.e., P = {fg : 9 S 0}, taking 9 as a vector 


of parameters in some parameter space 0. Further, Burnetas and Katehakis \1996b ) considered 
potentially different parameter spaees (and therefore potentially different parametric forms) for each 
bandit i. There, Conditions A1-A3 (hence Cl, C2 herein) and the corresponding indices were stated 

in terms of estimates for the bandit parameters, 9 (t) an estimate of the parameters C of bandit i, 
given t samples. In particular, Eq. ([^ appears essentially as 


u\n,t) = u]i],^{n,t,f[{t)) = sup g.{9() 

e'ee 


T/p r , Inn 


( 8 ) 


(1985 

) and 

Weber 

Katehakis ( 

2003 

1, 


area we refer to 


Slivkins (20121 


Previous work in this area includes Robbins (1952), and additionally Gittins (1979), Lai and Robbins 


(19921 there is a large literature on versions of this problem, cf. Burnetas and 


Burnetas and Katehakis (1997b I and references therein. For recent work in this 


Audibert et al. (20091, Auer and Ortner (20101 


Cappe et al. (2013), Kaufmann (2015), Li et al. 


Gittins et al. (2011), Bubeck and 


(2014), ?, Gowan and Katehakis 
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(20151, and references therein. For more general dynamic programming extensions we refer to 


Burnetas and Katehakis (1997a Butenko et al. (2003), Tewari and Bartlett (2008), Audibert et al. 


(2009), Littman (2012), Feinberg et al. (2014) and references therein. To our knowledge, outside 


the work in Lai and Robbins (1985), Burnetas and Katehakis (1996b) and Burnetas and Katehakis 


(1997a), asymptotically optimal policies have only been developed in Honda and Takemura (2013) 


for the problem discussed herein and in Honda and Takemura (2011) and Honda and Takemura 


( 2010 ) for the problem of finite known support where optimal policies, cyclic and randomized, that 


are si 
Other 

mpler to implement than those consider in Burnetas and Katehakis (1996b) were constructed. 

related work in this area includes: Katehakis and Derman 

198 

6 ), 1 

Oatehakis and Veinott Jr 

(1987 

1 , 

1 , 

Burnetas and Katehakis (199f 

), Burnetas and Katehaki 

s (1 

996j 

1 ), |Lagoudakis and Parr 

(2003 

Bartlett and Tewari (2009), Tekin and Liu ( 

2012), Jouini et al. 

2009), |Dayanik et al. (2013), 

Filippi et al. (2010), Osband and Van Roy (|2014 

, Burnetas and Katehakis] (1997a), Androulakis 

and Dimitrakakis (2014), Dimitrakakis 

( 2012 |). 


Optimal UCB Policies for Uniform Distributions 


3. The B-K Lower Bounds and Inflation Factors 


In this section we take T as the set of probability densities on i? uniform over some Gnite interval, 
taking f € jF as uniform over [«/,&/]• Note, as the family of densities is parametrizable, this largely 
falls under the scope of Burnetas and Katehakis (1996b). However, the results to follow seem to 


demonstrate a hole in that general treatment of the problem. 


Note, some care with respect to support must be taken in applying Burnetas and Katehakis (1996b) 
to this case, to ensure that the integrals remain well defined. But for this IF, we have that for a 
given f G F, for any g G F such that Sp(/) C Sp(g), i.e., Og < a/ and 6 / < bg, 


I(/,5)=E/ 



= In 


/ bg-ag \ 

V^/ -«// 


(9) 


If Sp(/) is not a subset of Sp(( 7 ), we take I(/, 5 ) as infinite. 

For notational convenience, given {fi} C F, for each i, we take fiGFas supported on some interval 
[ai,bi]. Note then, gi = (oi + bi)/2. 

Given t samples from bandit i, we take 


a? = 

bl = = maxA^, 


( 10 ) 


as the maximum-likelihood estimators of Ui and bi respectively. We may then define fl G F as the 
uniform density over the interval Note, fj: is the maximum-likelihood estimate of fi. 

We can now state and prove the following. 


Lemma 2 Under Assumption 1 the following are true. 




A,: 


■mFF In (1 + 57^) 


■*^BK (n, t, /t) — dl + - — 


,i/‘ 


( 11 ) 


( 12 ) 
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Proof Eq. ( [Tl] ) follows from Eq. ([^ and the observation that in this 


case: 


inf {I(/i,g) : /r(g) > n*} = In 


2/r* - 2a, 
b, - ai 


= In 1 + 


2/r* - 2/i, 

6,; - ai 


For Eq. (121 we have: 


UBKin,t, ft) = sup ^i{g) : ff^g) < 


= sup 

a<dj ,h'>b\ 

= sup 

a<a\ 

1 


a + 6 


: In 


Inn 


b — a 

K - 


< 


Inn 


a + ^ ° : {b — a) < {b\ — a\)v}/^ 


= a. 


- aj) 


,i/‘ 


(13) 


We are interested in policies tt such that lim„ i?,r(n)/lnn achieves the lower bound indicated above, 
for every choice of {f} C JF. Following the prescription of Burnetas and Katehakis (1996b I, i.e. Eq. 
(121, would lead to the following policy, 

Policy BK-UCB : ttbk- At each n = 1,2,...: 


i) For n = 1,2,..., 2A^, sample each bandit twice, and 

ii) for n > 2iV, let 7rBK(’^ + 1) be equal to: 


arg maxj 





,(") 


(14) 


breaking ties arbitrarily. 

It is easy to demonstrate that the estimators 0 (t) = {a\,b\) converge sufficiently quickly to (ai,bi) 
in probability that Condition (Cl) above is satished for fj:. Proving that Condition (C2) is satisfied, 
however, is much much more difficult, and in fact we conjecture that (C2) does not hold for policy 
ttbk- While this does not indicate that that ttbk fails to achieve asymptotic optimality, it does 
imply that the standard techniques are insufficient to verify it. However, asymptotic optimality may 
provably be achieved by an (seemingly) negligible modification, via the following policy. 


4. Asymptotically Optimal UCB Policy 


We propose the following policy: 

Policy UCB-Uniform: ttchk- At each n = 1,2,...: 

i) For n = 1,2,..., “iN sample each bandit three times, and 

ii) for n > 3N, let 7rcHK(»^ + 1) be equal to: 


argmaxi Oj,. 

' "^CHK ^ ' 




T' (n) 

’'CHK ^ ^ 


— a 


(n] 

’"CHK ^ ^ 




(15) 
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breaking ties arbitrarily. 

In the remainder of this paper, we verify the asymptotic optimality of ttchk (Theorem |^, and 
additionally give hnite horizon bounds on the regret under this policy (Theorem]^ [^. Further, 
while Theorem bounds the order of the remainder term as 0((lnn)^/^), this is rehned somewhat 
in Theorem [^to o((lnn)^/^+^). 

The Optimality Theorem and Finite Time Bounds 

For the work in this section it is convenient to define the bandit spans, Si = bi — ai. We take S'* to 
be the minimal span of any optimal bandit, i.e., 

S* = min Sh 


Recall that = /r* — = maxj{ ^ ^ . The primary result of this paper is the following. 

Theorem 3 For each sub-optimal i (i.e., /ii ^ p*), let (ei,Si) be such that 0 < < S,, 0 < < Si, 

and ei-\- 6i < A ^. For ttchk as defined above, for all n > 3N: 


■^7rc/fJc(^) A j ^ ( 


A,; 


v' fSi 3Si . 

E b: + 5T + '*0- 


Ini 


(16) 




The proof of Theorem is the central proof of this paper. We delay it briefly, to present two related 
results that can be derived from the above. The first is that ttchk is asymptotically optimal. 


Theorem 4 For ttchk as defined above, ttchk is asymptotically optimal in the sense that 

=Mbk({/J)= E 


In 5 




‘■(l+l^)' 


(17) 


Proof Fix the {ei,5i) as feasible in the hypotheses of Theorem]^ In that case, we have 


n Inn 


^ E 


A, 






(ej-eSj 


Taking the infimum as + (5^ —> 0 yields 

■^TTCHK (^) 


lim sup ■ 


Inn 


^ E 


A, 


>”(' + 1*)^ 


(18) 


(19) 


This, combined with the previous observation about the lim inf in Eq. © completes the result. 


We next give an ‘e-free’ version of the previous bound, which demonstrates the remainder term on 
the regret under ttchk is at worst 0((lnn)^/^). 
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Theorem 5 For each sub-optimal i (i.e., pLi ^ p,*), let Gi = min jA^). For all n > 3N, 

A,; 




/ 


+ E 




8G,A, 


y {Si + 2Ai) In ^1 + 


(Inn) 

- SG? 


(lnn)3/4 


( 20 ) 


+ E + E 

' '' 'i-n -A,,*- 


i-UiAr-* 


Proof [Proof of Theorem]^ Let 0 < e < 1, and for each i let Ci = 5i = GiC. Hence, 


In 1 + 


2A,, 


1 - 


{cj + 5i) 

A,, 


= In 1 + 


2A, 

5,: 


1-e- 


2G,; 


Define 


D, = 


i-'(i + l‘(i-'ig‘)) i"(' + l‘)' 


( 21 ) 

( 22 ) 


Note the following bound, that 


D* < 


< 


2Gie 


Ai — 2Gie 

f 2.Gie 


2A, 


(5, + 2 A,)ln(l + ^)' 

2A,, 


\l^^J (5^ + 2A,)ln(l+^)^ 

8 G,;e 


(23) 


{Si + 2Ai) In 


This first inequality is proven separately as Proposition in the Appendix. The second inequality 
is simply the observation that 2Gie < 2Gi < ^A^. Applying this bound to Theorem yields the 
following bound, 


R^cnAn)<{ E G , 2 ^^ 

In + 5 . j 

/ 


(Inn) 


8 


G,A, 


E — 

{Si + 2Ai) In ^1 + 


e Inn 


(24) 


E 

-u-iAr* 


SiAi 1 


E 

yi-.fj.iAfj,” 


A 

G? 


* 1 e-3 


+ 18 I 

Ki-r-iAr-* 


Taking e = (Inn) completes the proof. 
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Proof [Proof of Theorem 1] For any i such that 7 ^ /i* , recall that bandit i is taken to be uniformly 
distributed on the interval [ai,bi]. Let {ei,6i) be as hypothesized. In this proof, we take tt = ttchk 
as defined above. Additionally, for each i we define Wl = maxt<kXl and I 4 * = uux\t<kXl. We 
define the index function 

u,ik,j) = V^ + \{W]-Vi)k^. (25) 

We define the following events of interest, = {ui{t,Tl{t)) > ^* — Ci} and JC\ = {V^ < Ui + 5^}. 
We now define the following quantities: For n > 3N, 


i\{n,ei,6i)= l{7r(t + 1) = i, 

t^3N 

n 

= Y +1) = b 

t^3N 

n 

il{n,ei,Si)= Y + 1 ) = b Ji*}- 


Hence, we have the following relationship for n > 3N, that 


T^{n + 1) — 3 + Y^ + 1) — 0 


= 3 + n\{n, ei,6i) + n\{n, e*, 5i) + n\{n,ei,5i). 
The proof proceeds by bounding, in expectation, each of the three terms. 
Observe that, by the structure of the index function Ui, 

i{7T{t + i) = i,j:,icYt)} 

< 1 |7r(t + 1) = i,ai + di + > P* - Cij 


Hence, 


= l<^7r(t + l)=*,r;(t)< 


1 / 2k.‘-2ai-2ei-2Si \ 

V b-o, J 


<l{Tr{t + l)=i,mt)< 


2ix* — 2 aj — 2 €j — 2 < 5 j 

bi—ai 


n\{n,ei,6i) < 


<^li7r(t + l)=*,r;(t)< 


2(1* —2ai—2ei—26i 
bi—ai 


2(1* — 2ai — 2€i—2Si 
bi—ai 


2(1* — 2ai —2€i—2Si 
bi—ai 


+ 2 + 2 . 
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The last inequality follows, observing that may be expressed as the sum of 7r(t) = i indicators, 

and seeing that the additional condition bounds the number of non-zero terms in the above sum. 
The additional +2 simply accounts for the 7r(l) = i term and the 7r(n 1) = i term. 

Note, this bound is sample-path-wise. 

For the second term. 


n 

n\{n,ei,5i)< ^ l{7r(t + 1) = 

n t 

= E E + 1 ) = *’ w = 

t^3N k^2 
n t 

= E E + 1 ) = *’ 

t^3N k^2 
n n 

< ^ l{7r(t +1) = z,T;(t) = k} 

k—2 t—k 

n 

k^2 

n 

= ^ IjV'/c > Oi + ^i}- 

k=2 


(30) 


The last inequality follows as, for fixed fc, {7r(t -|- 1) = = k} may be true for at most one 

value of t. It follows then that 

n 

E [n\{n,e^,5i)] {Vk > -\-Si) 


n 

= Y,'P{X{>a, + 5,f 



To bound the n| term, observe that in the event 7r(t -|- 1) = i, from the structure of the policy it 
must be true that Ui{t,T^{t)) = maxj Uj{t,T^{t)). Thus, if i* is some bandit such that /ii» = fj,*, 
(t)) < In particular, we take i* to be the optimal bandit realizing the minimal 

span bi* — Oi*. It follows, 

n 

nl{n,ei,Si) < ^ l{7T{t-\-1) = (t)) < fi’" - Ci} 

t^3N 
n 

< ^ (32) 

t^3N 

n 

< ^ l{ui> {t, s) < ^* — Ci for some 3 < s < t}. 

t=3N 
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The last step follows as for t in this range, 3 < (t) < t. Hence 

E [n\{n,ei,5i)\ 

n 

< P {ui- (t, s) < fi* — ti for some 3 < s <t) 

t~^N (33) 

n t 

< ^ ^ P {Ui, {t, s) < n* - e,). 

t—3N s—3 

Here we may make use of the following result: 


Lemma 6 Let Xi, X 2 ,... be i.i.d. Unij[a, b] random variables, with a <b, a and b finite. For k > 2, 
let Wk = max(<fe Xt and 14 = mint<fc Xt- In that case, the joint density of (Wk, 14) is given by: 


fk{'w,v) = 


k{k — l){b—a) ^{w — vY ^ if v < w 
0 else. 


(34) 


We therefore have that 


P S) < fl* - Ci) 

= P (^H;* + ^ (Wf - Vf) 




fs{w, v)dwdv 


< 




-Ei PV 

J ai* J V 


fsiw, v)dwdv 


(35) 


1 (°-i) f Au* — ti) — ai* 

= -t ("-2) 2- ' 


bi* — Ui* 


1 


= 1 - 


2ei 


bj* a,* 


The last step is simply the observation that /i* = (oi* + 5i*)/2. For convenience, let a = 2ei/(&i* — 
tti*). We therefore have that 

1 


^P(u,.(t,s) </r*-e) < ^-t H ^^(1-a) 


s=3 


s=3 

t-2 




s-\-2 


(36) 


S = 1 


<-t-i(l-a)2^t-i/^(l-a)4 


S=1 


Hence, from Eq. (33) and the above, 


E [n^(n,ei,(5*)] < ^-t ^{l-af^t ^/^(l-a)" 




s=l 
n oo 




(37) 


< (1-a)^ 15 + 


t—6 s=l 

3 
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The last step is a bound proved separately as Proposition in the Appendix. Observing further 
that 1 — a < 1 , we have finally that 


E [nl{n,ei,5i)] < 15 + -^ = 15 + - 3 ^^^. 

Observing that T^{n) < T^(n + 1), bringing the three terms together we have that 

Inn bi — ai 3(6i. — 


E [Kin)] < 


In 


^ 2 ^ 


-2ai-2ei-2Si 

bi—ai 


18. 


(38) 


(39) 


The result then follows from the definition of regret, Eq. and the observation again that 

/ij = (bi CLi')j2i. H 


At various points in the results so far, choices of convenience were made with the purpose of keeping 
associated constants and coefficients ‘nice’. The techniques and results above may actually be refined 
slightly to present a somewhat stronger result on the remainder term, at the cost of more complicated 
coefficients. In particular. 


Theorem 7 For any /3 > 0, 

RncHKin) < 


Ai Inn 




(i + l*) 


In 


+ o((lnn)2/3+/3). 


(40) 


Proof Note that, given the result of Theorem]^ it suffices to take /3 < 1/12. 

Building on the proof of Theorem]^ taking a = 2eil{hi* — Oj.) = where i* is the optimal 

bandit that realizes the smallest value of bi» — ai », we have that 


E [Tl{n)] < 


Inn 


bi - ai 


-2ai-2ti-25i 

bi—ai 


- n 00 

-(l-a)2^t-i£f-i/^(l-a)^ 


+ 3 


< 


t—6 s=l 

Inn 


In ( 1 + ^ ( 1 - 




(41) 


1 






t—Q 

The proof of Theorem|^then proceeded to bound the above double sum using Proposition!^ Utilizing 
the proof of Proposition]^ (but without choosing specific values of p < 1, g > 1 to render ‘nice’ 
coefficients), we have 




t—Q s=l 


< 


i p + g 

e 1 — p 


1 / 1 g 


a \eap 


g -1 


(42) 


= a ^ ?Ci(p,g) + C' 2 (p,g) 


- 1-2 


1+S 


Clip, q) + C2ip,q). 
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Where for convenience we are defining Ci,C 2 as the associated functions of p,q. Note, they are 
finite for p < 1, g > 1. Let 0 < e < 1 and define Gi = min (^S^, S'i, as in Theorem^ Taking 
= Si = tGi, we have the following bound (utilizing Propositionas in the proof of Theorem]^: 


E [r;(n)] < 


Inn 


SGiclnn 


(5, + 2A01n(l + ^) 


-1-^ 


+ —e-i 

Gi 


+ C'2(p, q) + 3. 


l+y 


Ci{p,q) 


(43) 


At this point, taking e = (Inn) p/( 2 p+ 9) yields the following 


E [Tl{n)] < 


Inn 


8G,(lnn)5i^ 


—7-V H-^^; 

+ (^z + 2A,)ln(l + ^)' 


Si /I . p , p+g 

+ — (lnn)2p+g + (lnn)2p+g 
Gi 

+ C 2 {p, q) + 3, 


2G, 


i+- 


Ci{p,q) 


(44) 


or more conveniently, 


E [Kin)] 



+ 0((lnn) 2p+9) + 0((lnn) ). 


(45) 


Taking q = jp, where {'j,p) is chosen such that I/7 < p < 1, the above yields (via the definition of 
regret, Eq. 


A^(n) < 


Ai Inn 

■"(i + l*) 


l + T 

0((ln n)2+7 


0((lnn) s+T). 


(46) 


At this point, note that taking 7 = 2 recovers the remainder order given in TheoremFor a given 
1/12 > /3 > 0, taking 7 < (l+6/3)/(l —3/3) yields (l+7)/(2+7) < 2/3+,d, and completes the proof. ■ 


5. Simulation Comparisons of the ttck Sampling 


In order to obtain a picture of the benefits of the ttck sampling policy, we compared it with the 
best known alternatives. In both figures below, curve (i) {£ = 1,2,3) is a plot of the average (over 
20,000 repetitions in Fig. 1 and 10,000 repetitions in Fig. 2) regret of sampling using policies ttck, 
ttkr, and ttchk, respectively; where policy ttkr, is based on the sampling policy in [Katehakis and] 
Robbins (1995), and ttchk is a recently shown, cf. Cowan et al. (2015), asymptotically optimal 


policy for the case in which the population outcomes distributions are normal with unknown means 
and unknown variances. Specifically, given t samples from bandit i at round (global time) n, ttck 
TTCHK and ttkr are maximum index based policies with indices u\^^in,t), u^hK) '“kr where the 

first is defined by Eq. (16) and the other two are given by: Mchk(^j0 = K + S^{t )\/— 1 and 

ul<iKin,t) = XI + where S^t) = (^^^^ (A/ - Xlf 
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Figure 1: Short Time Horizon:Numerical regret comparison of tick, ttkr, and ttchk, for the 6 bandits 
with parameters given in Table 1. Average values over 20,000 repetitions. 


i 

1 

2 

3 

4 

5 

6 

ai 

0 

0 

0 

1 

1 

1 


10 

9 

8 

9.5 

10 

5 


Table 1 



Figure 2: Log Time HorizonrNumerical regret comparison of ttck, ttkr, and ttchk, for the 6 bandits with 
parameters given in Table 1. Average values over 10,000 repetitions. 


These graphs clearly illustrate the benefit of using the optimal policy. 
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5.2 Additional Proofs 

Proposition 8 For 0 < a < 1, for all n> 3, 


n ^ oo 




t—3 s—1 


Proof [Proof of Proposition]^ Let 1 > p > 0. We have 

OO 


lHtr\ OO 

5=1 s=[ln(t)p] 

Llnlt)”! OO 

< E E 

s=l s=['ln(t)p] 

< [ln(t)PJ t-i/Lin(t)C + 1(1 _ a)rin(t)C 


a 


< + 1(1 - 


a 




Here we may make use of the following bounds, that for a; > 0, g > 0, 




p+g 

-p 


(l-ar < 

Applying these to the above, 


e 1 — p 

, . -Y 

em[l —a) pj \eapj 


' InW-. 

\\el—j>/ a \eap J J 


(47) 


(48) 


(49) 


(50) 
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Hence, taking <7 > 1, 




t—6 s—1 


< 


< 


P+9 

1 p + <7 \ 
e 1 — p 


a \eap 


i p + g \ 1 [ J_9 

el—py a \eap 

) P+9 . 

1 / 

a\eap 


< 


P+9 

lp + q \ 1 -*’ 
e 1 — p 


a \eap 


At this point, taking q = 2p and p = 0.55 yields 


71-00 


^ I ^ - ay < 29.9628 


E 7 


t=3 


- ln(t) ^dt 


1 — ln(n) 

9 ^ 


1-9 


9-1 


t 

t—3 s=l 

which, rounding up, completes the result. 


5.41341 


(51) 


(52) 


Proposition 9 For Q > 0, and 0 < e < 1, the following bound holds: 


1 


< 


1 


+ 


Q 


ln(l + <5(1 — e)) ln(l + Q) 1 — e (1 + Q) ln(l + Q) 


(53) 


Proof [Proof of Proposition Let A{Q,e) denote the RHS of the above, B{Q,e) denote the left. 
We adopt the physicists’ convention of denoting the partial derivative of F with respect to x as F^- 


Note, A{Q,Q) < B[Q,Q). Hence, it suffices to demonstrate that A^ < B^ over this range or, since 
they are both positive, 

Ae _ (1 + Q)(l — e)^ lii(l + Q)^ 


< 1. 


(l + Q(l-e))ln(l + g(l-e))2 
We take, for convenience, (5 = 1 — e, and want to show that for 0 < 5 < 1: 


(54) 


(1 + ln(l + Q) 
{1 + Q8) ln(l + Q5)' 


7 < 1 . 


(55) 


The above inequality holds when <5=1. Taking C(S,Q) as the above simplified ratio, it suffices to 
show that Cs > 0. Simplifying this inequality and canceling the positive factors, it is equivalent to 
show that —2QS + (2 + QS) ln(l + QS) > 0, or taking x = QS > 0, 


ln(l + a;) > 


2x 

2 F X 


(56) 


This is a fairly standard and easily verihed inequality for In. This completes the proof. 
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